Skip to content

Domain decomposition and halo construction#540

Open
halungge wants to merge 304 commits intomainfrom
halo_construction
Open

Domain decomposition and halo construction#540
halungge wants to merge 304 commits intomainfrom
halo_construction

Conversation

@halungge
Copy link
Contributor

@halungge halungge commented Sep 6, 2024

Decompose (global) grid file:

  • uses pymetis to decompose the global grid (cells) into n patches
  • after decomposition halos for all dimensions (cell, edge, vertex) are constructed. Halo construction is done in a ICON like fashion: They consist halos of 2 cell levels (one upward and one downward pointing line) and the corresponding vertices and edges on these lines.

Omissions:

  • LAM grids need to be investigated further:

    • tests comparing decomposed vs. single_node computation are only run on the global grid.
    • for the LAM grids ICON reorders arrays to arrange halo points on the first boundary layers together with the boundary layers, it should be investigated whether that is essential in the model.
    • This PR does only take this into account on the computation of the start_index and end_index not in the halo construction.
  • the number of halo lines (in terms of cells) is hardcoded to 2, that could be made a parameter.

  • Not sure it all runs on GPU correctly... most probably there are some numpy cupy issues to fix.

Magdalena Luz added 13 commits July 10, 2025 16:54
# Conflicts:
#	model/common/pyproject.toml
# Conflicts:
#	model/common/src/icon4py/model/common/grid/base.py
#	model/common/src/icon4py/model/common/grid/grid_manager.py
#	model/common/src/icon4py/model/common/grid/simple.py
#	model/common/tests/decomposition_tests/mpi_tests/test_mpi_decomposition.py
@halungge halungge force-pushed the halo_construction branch from df9c2ef to 72d4d4b Compare July 25, 2025 16:30
Magdalena Luz added 11 commits July 29, 2025 08:50
# Conflicts:
#	model/atmosphere/dycore/tests/dycore_tests/mpi_tests/conftest.py
#	model/atmosphere/subgrid_scale_physics/muphys/tests/muphys/fixtures.py
#	model/common/pyproject.toml
#	model/common/src/icon4py/model/common/decomposition/definitions.py
#	model/common/src/icon4py/model/common/grid/grid_manager.py
#	model/common/tests/common/decomposition/mpi_tests/test_mpi_decomposition.py
#	model/common/tests/common/decomposition/unit_tests/test_definitions.py
#	model/common/tests/common/grid/mpi_tests/test_parallel_icon.py
#	model/common/tests/common/io/unit_tests/test_io.py
#	model/common/tests/decomposition_tests/__init__.py
#	model/common/tests/decomposition_tests/mpi_tests/test_mpi_decomposition.py
#	model/common/tests/decomposition_tests/test_mpi_decomposition.py
#	model/testing/src/icon4py/model/testing/datatest_utils.py
#	model/testing/src/icon4py/model/testing/fixtures/datatest.py
#	model/testing/src/icon4py/model/testing/grid_utils.py
#	model/testing/src/icon4py/model/testing/parallel_helpers.py
Comment on lines +194 to +204
# TODO(msimberg): What should we do about this. (The global) num_cells is
# not guaranteed to be set here when used through fortran. Should we:
# 1. Ignore distributed?
# 2. Compute num_cells with a reduction?
# 3. Use a ProcessProperties to detect it?
distributed = (
config.num_cells < global_properties.num_cells
if global_properties.num_cells is not None
else False
)
limited_area_or_distributed = config.limited_area or distributed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be good to resolve before this PR is merged.

Comment on lines +129 to +136
# TODO(msimberg): Is halo always expected to be populated?
global_indices_local_field = decomposition_info.global_index(
dim,
decomp_defs.DecompositionInfo.EntryType.OWNED, # ALL if checking halos
)
local_indices_local_field = decomposition_info.local_index(
dim,
decomp_defs.DecompositionInfo.EntryType.OWNED, # ALL if checking halos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should still be fixed. I'm thinking to add a check_halos parameter to the test. For fields that have the halo populated, check that the halo has been correctly populated.

Comment on lines +175 to +179
# TODO(msimberg): Is this true? Not true for RBF itnerpolation... why?
# We expect an exact match, since the starting point is the same (grid
# file) and we are doing the exact same computations in single rank and
# multi rank mode.
np.testing.assert_allclose(sorted_, global_reference_field, atol=1e-9, verbose=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix or remove todo before merging.

My first guess is that neighbors may be differently ordered on the local patch compared to the global grid, but can that actually happen? If it can, I can see the cholesky decomposition leading to different results. If not, I would expect exact results.

attrs_name: str,
dim: gtx.Dimension,
) -> None:
# TODO(msimberg): Currently segfaults. Are topography and vertical fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcanton this is the test that segfaults. I've tried to set up a dummy vertical config and topography, but I may have misunderstood what needs to/can go in them. Does what I've added make sense or do you see any obvious mistakes? With embedded I get some non-segfault error messages that may be helpful to debug. I may just have set up the vertical levels incorrectly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this was just a matter of having a consistent num_levels. The grid managers used a default of 10 while the configs manually constructed for this test used experiment.num_levels. So the original segfault is fixed.

However, I'd still appreciate knowledgeable eyes on whether the vertical config and topography that I added makes sense for this test.

@msimberg
Copy link
Contributor

cscs-ci run default

@msimberg
Copy link
Contributor

cssc-ci run distributed

Copy link
Contributor

@nfarabullini nfarabullini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor things for now as I understand that more work will be done.

Great job so far!!

Comment on lines +445 to +450
else:
return IconLikeHaloConstructor(
run_properties=run_properties,
connectivities=connectivities,
allocator=allocator,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else:
return IconLikeHaloConstructor(
run_properties=run_properties,
connectivities=connectivities,
allocator=allocator,
)
return IconLikeHaloConstructor(
run_properties=run_properties,
connectivities=connectivities,
allocator=allocator,
)

the else is not needed here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer the explicit else, but don't feel strongly enough to oppose removing it either. Do you prefer it without the else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I do because it looks cleaner to me, but I also don't have particularly strong feelings

# 3. Use a ProcessProperties to detect it?
distributed = (
config.num_cells < global_properties.num_cells
if global_properties.num_cells is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if global_properties.num_cells is not None
if global_properties.num_cells

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block is a TODO. Ideally the None wouldn't even be possible. Like the other None I still like the explicit comparison to None because 0 is falsy, though 0 would of course not be a useful value for num_cells.

Hopefully this whole thing is removed.

@github-actions
Copy link

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants